<!DOCTYPE html>

<html>

  <head>
    <title>Robotic Manipulation: Geometric Pose Estimation</title>
    <meta name="Robotic Manipulation: Geometric Pose Estimation" content="text/html; charset=utf-8;" />
    <link rel="canonical" href="http://manipulation.csail.mit.edu/pose.html" />

    <script src="https://hypothes.is/embed.js" async></script>
    <script type="text/javascript" src="htmlbook/book.js"></script>

    <script src="htmlbook/mathjax-config.js" defer></script> 
    <script type="text/javascript" id="MathJax-script" defer
      src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-chtml.js">
    </script>
    <script>window.MathJax || document.write('<script type="text/javascript" src="htmlbook/MathJax/es5/tex-chtml.js" defer><\/script>')</script>

    <link rel="stylesheet" href="htmlbook/highlight/styles/default.css">
    <script src="htmlbook/highlight/highlight.pack.js"></script> <!-- http://highlightjs.readthedocs.io/en/latest/css-classes-reference.html#language-names-and-aliases -->
    <script>hljs.initHighlightingOnLoad();</script>

    <link rel="stylesheet" type="text/css" href="htmlbook/book.css" />
  </head>

<body onload="loadChapter('manipulation');">

<div data-type="titlepage">
  <header>
    <h1><a href="index.html" style="text-decoration:none;">Robotic Manipulation</a></h1>
    <p data-type="subtitle">Perception, Planning, and Control</p> 
    <p style="font-size: 18px;"><a href="http://people.csail.mit.edu/russt/">Russ Tedrake</a></p>
    <p style="font-size: 14px; text-align: right;"> 
      &copy; Russ Tedrake, 2020<br/>
      Last modified <span id="last_modified"></span>.</br>
      <script>
      var d = new Date(document.lastModified);
      document.getElementById("last_modified").innerHTML = d.getFullYear() + "-" + (d.getMonth()+1) + "-" + d.getDate();</script>
      <a href="misc.html">How to cite these notes, use annotations, and give feedback.</a><br/>
    </p>
  </header>
</div>

<p><b>Note:</b> These are working notes used for <a
href="http://manipulation.csail.mit.edu/Fall2020/">a course being taught
at MIT</a>. They will be updated throughout the Fall 2020 semester.  <!-- <a 
href="https://www.youtube.com/channel/UChfUOAhz7ynELF-s_1LPpWg">Lecture  videos are available on YouTube</a>.--></p> 

<table style="width:100%;"><tr style="width:100%">
  <td style="width:33%;text-align:left;"><a class="previous_chapter" href=pick.html>Previous Chapter</a></td>
  <td style="width:33%;text-align:center;"><a href=index.html>Table of contents</a></td>
  <td style="width:33%;text-align:right;"><a class="next_chapter" href=clutter.html>Next Chapter</a></td>
</tr></table>


<!-- EVERYTHING ABOVE THIS LINE IS OVERWRITTEN BY THE INSTALL SCRIPT -->
<chapter style="counter-reset: chapter 3"><h1>Geometric Pose Estimation</h1>
  <a style="float:right; margin-top:-80px;" target="pose" href="https://colab.research.google.com/github/RussTedrake/manipulation/blob/master/pose.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open Corresponding Notebook In Colab"/></a>
  <div style="clear:right;"></div>

  <p>In the last chapter, we developed an initial solution to moving objects
  around, but we made one major assumption that would prevent us from using it
  on a real robot: we assumed that we knew the initial pose of the object.  This
  chapter is going to be our first pass at removing that assumption, by
  developing tools to estimate that pose using the information obtained from the
  robot's depth cameras.  The tools we develop here will be most useful when you
  are trying to manipulate <i>known objects</i> (e.g. you have a mesh file of
  their geometry) and are in a relatively <i>uncluttered</i> environment.  But
  they will also form a baseline for the more sophisticated methods we will
  study.</p>

  <p>The approach that we'll take here is very geometric.  This is in contrast
  to, and very complimentary with, approaches that are more fundamentally driven
  by data.  There is no question that deep learning has had an enormous impact
  in perception -- it's fair to say that it has enabled the current surge in
  manipulation advances -- and we will certainly cover it in these notes.  But
  when I've heard colleagues say that "all perception these days is based on
  deep learning", I can't help but cry foul.  There has been another surge of
  progress in the last few years that has been happening in parallel: the
  revolution in geometric reasoning, fueled by applications in autonomous
  driving and virtual/augmented reality in addition to manipulation.  Most
  advanced manipulation systems these days combine both "deep perception" and
  "geometric perception".</p>

  <section><h1>Cameras and depth sensors</h1>
  
    <p>Just as we had many choices when selecting the robot arm and hand, we
    have many choices for instrumenting our robot/environment with sensors. Even
    more so than our robot arms, the last few years have seen incredible
    improvements in the quality and reductions in cost and size for these
    sensors.  This is largely thanks to the cell phone industry, but the race
    for autonomous cars has been fueling high-end sensors as well.</p>

    <p>These changes in hardware quality have caused sometimes dramatic changes
    in our algorithmic approaches.  For example, estimation can be much easier
    when the resolution and frame rate of our sensors is high enough that not
    much can change in the world between two images; this undoubtedly
    contributed to the revolutions in the field of "simultaneous localization
    and mapping" (SLAM) we have seen over the last decade or so.</p>

    <p>One might think that the most important sensors for manipulation are the
    touch sensors (you might even be right!).  But in practice today, most of
    the emphasis is on camera-based and/or range sensing.  At very least, we
    should consider this first, since our touch sensors won't do us much good if
    we don't know where in the world we need to touch.</p>

    <p>Traditional cameras, which we think of as a sensor that outputs a color
    image at some framerate, play an important role.  But robotics makes heavy
    use of sensors that make an explicit measurement of the distance (between
    the camera and the world) or depth; sometimes in addition to color and
    sometimes in lieu of color.  Admittedly, some researchers think we should
    only rely on color images.</p>

    <subsection><h1>Depth sensors</h1>
    
      <p>Primarily RGB-D (ToF vs projected texture stereo vs ...) cameras and Lidar</p>

      <p>The cameras we are using in this course are <a href="https://software.intel.com/en-us/realsense/d400">Intel RealSense D415</a>....</p>

      <p>Monocular depth.</p>

      <todo>How the kinect works in 2 minutes: https://www.youtube.com/watch?v=uq9SEJxZiUg</todo>

    </subsection>

    <subsection><h1>Simulation</h1>
    
      <p>There are a number of levels of fidelity at which one can simulate a
      camera like the D415.  We'll start our discussion here using an "ideal"
      RGB-D camera simulation -- the pixels returned in the depth image
      represent the true geometric depth in the direction of each pixel
      coordinate.  In <drake></drake> that is represented by the
      <code>RgbdSensor</code> system, which can be wired up directly to the
      <code>SceneGraph</code>.</p>

      <div>
        <script src="htmlbook/js-yaml.min.js"></script>
        <script type="text/javascript">
        var sys = jsyaml.load(`
name: RgbdSensor
input_ports:
- geometry_query
output_ports:
- color_image
- depth_image_32f
- depth_image_16u
- label_image
- X_WB`);
        document.write(system_html(sys, "https://drake.mit.edu/doxygen_cxx/classdrake_1_1systems_1_1sensors_1_1_rgbd_sensor.html"));
        </script>
      </div>

      <p>The signals and systems abstraction here is encapsulating a lot of
      complexity.  Inside the implementation of that system is a complete
      rendering engine, like one that you will find in high-end computer games.
      Drake actually supports multiple rendering engines; for the purposes of
      this class we will primarily use an OpenGL-based renderer that is suitable
      for real-time simulation.  In our research we also use Drake's rendering
      API to connect to high-end <a
      href="https://en.wikipedia.org/wiki/Physically_based_rendering">physically-based
      rendering (PBR)</a> based on ray tracing, such as the <a
      href="https://www.blender.org/features/rendering/">Cycles renderer
      provided by Blender</a>.  These are most useful when we are trying to
      render higher quality images e.g. to <i>train</i> a deep perception
      system; we will discuss them in the deep perception chapters.</p>

      <example class="drake"><h1>Simulating an RGB-D camera</h1>

        <p><a target="pose"
          href="https://colab.research.google.com/github/RussTedrake/manipulation/blob/master/pose.ipynb#scrollTo=7q0A14bAilIX"><img
          src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
          </p>

        <p>As a simple example of depth cameras in drake, I've constructed a
        scene with a single object (the mustard bottle from the <a
        href="https://github.com/RobotLocomotion/models/tree/master/ycb">YCB
        dataset</a>), and added an <code>RgbdSensor</code> to the diagram.  Once
        this is wired up, we can simply evaluate the output ports in order to
        obtain the color and depth images:</p>

        <figure><img src="data/mustard.svg"/></figure>
    
        <p><img style="height:200px;float:right;margin-left:10px"
        src="data/mustard_meshcat.png"/> Please make sure you spend a minute
        with the MeshCat visualization (<a href="data/mustard.html">available
        here</a>).  You'll see a camera and the camera frame, and you'll see the
        mustard bottle as always... but it might look a little funny.  That's
        because I'm displaying both the ground truth mustard bottle model
        <i>and</i> the point cloud rendered from the cameras.  You can use the
        MeshCat GUI to uncheck the ground truth visualization (image right), and you'll be able to see just the point cloud.</p>

        <p>Remember that, in addition to looking at the source code if you like,
        you can always inspect the block diagram to understand what is happening
        at the "systems level".  <a href="data/depth_camera_diagram.svg">Here is the diagram</a> used in this example.</p>

      </example>
    
      <p>In the <code>ManipulationStation</code> simulation of the entire robot
      system for class, the <code>RgbdSensors</code> have already been added,
      and their output ports exposed as outputs for the station.</p>

      <subsection><h1>Sensor noise and depth dropouts</h1>

        <p>Real depth sensors, of course, are far from ideal -- and errors in
        depth returns are not simple Gaussian noise, but rather are dependent on
        the lighting conditions, the surface normals of the object, and the
        visual material properties of the object, among other things.  The color
        channels are an approximation, too.  Even our high-end renderers can
        only do as well as the geometries, materials and lighting sources that
        we add to our scene, and it is very hard to capture all of the nuances
        of a real environment.  We will examine real sensor data, and the gaps
        between modeled sensors and real sensors, after we start understanding
        some of the basic operations.</p>

        <figure><img style="width:95%"
        src="data/Sweeney18a_fig3.png"/><figcaption>Real depth sensors are
        messy.  The red pixels in the far right image are missed returns -- the
        depth sensor returned "maximum depth".  Missed returns, especially on
        the edges of objects (or on reflective objects) are a common sighting in
        raw depth data.  This figure is reproduced from
        <elib>Sweeney18a</elib>.</figcaption></figure>

      </subsection>
      <subsection><h1>Occlusions and partial views</h1>
    
      <p>The view of the mustard bottle in the example above makes one primary
      challenge of working with cameras very clear. Cameras don't get to see
      everything!  They only get line of sight.  If you want to get points from
      the back of the mustard bottle, you'll need to move the camera.  And if
      the mustard bottle is sitting on the table, you'll have to pick it up if
      you ever want to see the bottom.  This situation gets even worse in
      cluttered scenes, where we have our views of the bottle <i>occluded</i> by
      other objects.  Even when the scene is not cluttered, a head-mounted or
      table-mounted camera is often blocked by the robot's hand when it goes to
      do manipulation -- the moments when you would like the sensors to be at
      their best is often when they give no information!</p>

      <p>It is quite common to mount a depth camera on the robot's wrist in
      order to help with the partial views.  But it doesn't completely resolve
      the problem.  All of our depth sensors have both a maximum range and a
      minimum range.  The RealSense D415 returns matches from about 0.3m to
      1.5m.  That means for the last 30cm of the approach to the object --
      again, where we might like our cameras to be at their best -- the object
      effectively disappears!</p>

      <figure><img style="width:80%" src="data/dishloading_cameras.svg">
      <figcaption>When luxury permits, we try to put as many cameras as we can
      into the robot's environment.  In this dish-loading testbed at <a
      href="https://www.tri.global/news/tri-taking-on-the-hard-problems-in-manipulation-re-2019-6-27">TRI</a>
      where we put a total of eight cameras into the scene (it still doesn't
      resolve the occlusion problem).</figcaption></figure>

    </subsection>
  </section>

  <section><h1>Representations for geometry</h1>

    <p>Depth image, point cloud, triangulated mesh, signed distance functions, (+ surfels, occupancy maps, ...)</p>

    <p>The data returned by a depth camera takes the form of an image, where
    each pixel value is a single number that represents the distance between the
    camera and the nearest object in the environment along the pixel direction.
    If we combine this with the basic information about the camera's intrinsic
    parameters (e.g. lens parameters, stored in the <a
    href="https://drake.mit.edu/doxygen_cxx/classdrake_1_1systems_1_1sensors_1_1_camera_info.html"><code>CameraInfo</code></a>
    class in Drake) then we can transform this <i>depth image representation</i>
    into a collection of 3D points, $s_i$.  I use $s$ here because they are
    commonly referred to as the "scene points" in the algorithms we will present
    below.  These points have a pose and (optionally) a color value or other
    information attached.  This is known as a <i>point cloud representation</i>.
    By default, their pose is described in the camera frame, $^CX^{s_i}$, but
    the <code>DepthImageToPointCloud</code> system in Drake also accepts $^PX^C$
    to make it easy to transform them into another frame (such as the world
    frame).</p>  
    
    <div>
      <script src="htmlbook/js-yaml.min.js"></script>
      <script type="text/javascript">
      var sys = jsyaml.load(`
name: DepthImageToPointCloud
input_ports:
- depth_image
- color_image (optional)
- camera_pose (optional)
output_ports:
- point_cloud`);
      document.write(system_html(sys, "http://drake.mit.edu/doxygen_cxx/classdrake_1_1perception_1_1_depth_image_to_point_cloud.html"));
      </script>
    </div>
    
    <p>As depth sensors have become so pervasive the field has built up
    libraries of tools for performing basic geometric operations on point
    clouds, and that can be used to transform back and forth between
    representations.  We can implement many of of the basic operations directly
    in Drake, but we will use the <a href="http://www.open3d.org/">Open3D</a>
    library if we need more.  Many older projects used the <a
    href="http://pointclouds.org/">Point Cloud Library (PCL)</a>, which is now
    defunct but still has some very useful documentation.</p>

    <p>It's important to realize that the conversion of a depth image into a
    point cloud does lose some information -- specifically the information
    about the ray that was cast from the camera to arrive at that point.  In
    addition to declaring "there is geometry at this point", the depth image
    combined with the camera pose also implies that "there is no geometry in the
    straight line path between camera and this point".  We will make use of this
    information in some of our algorithms, so don't discard the depth images
    completely!  More generally, we will find that each of the different
    geometry representations have strengths and weaknesses -- it is very common
    to keep multiple representations around and to convert back and forth
    between them.</p>

  </section>

  <!--
  <section><h1>Working with point clouds</h1>

    <p>First, let us get a few simulated point clouds to work with.  We can use either <drake></drake> or Open3d to convert the depth image into a point cloud; we'll use Drake for this step here simply because we already have the camera parameters in the Drake format.  Although we can call the conversion functions directly, we can also add a system that does the conversion for us in the systems framework:</p>

    <table align="center" cellpadding="0" cellspacing="0">
      <tr align="center">
      <td><table cellspacing="0" cellpadding="0">
      <tr>
      <td align="right" style="padding:5px 0px 5px 0px">depth_image &rarr; </td></tr>
      <tr>
      <td align="right" style="padding:5px 0px 5px 0px">color_image (optional) &rarr; </td></tr>
      <tr>
      <td align="right" style="padding:5px 0px 5px 0px">camera_pose (optional) &rarr;</td></tr>
      </table>
      </td><td align="center" style="border:solid;padding-left:20px;padding-right:20px" bgcolor="#F0F0F0"><a class="el" href="http://drake.mit.edu/doxygen_cxx/classdrake_1_1perception_1_1_depth_image_to_point_cloud.html" title="Converts a depth image to a point cloud.">DepthImageToPointCloud</a></td><td><table cellspacing="0" cellpadding="0">
      <tr>
      <td align="left" style="padding:5px 0px 5px 0px">&rarr; point_cloud </td></tr>
      </table>
      </td></tr>
    </table>

    <p>You can infer from the optional input ports on this system that point cloud representations can potentially include color values for each of the Cartesian points.  The following example will get us a few point clouds to work with.</p>

    <todo>Combine all point clouds via https://drake.mit.edu/pydrake/pydrake.systems.perception.html?highlight=pointcloud#pydrake.systems.perception.PointCloudConcatenation</todo>

    <todo>Show the point cloud in Open3D's visualizer</todo>

    There are a number of basic operations that one can do on point clouds.  I'll give just a few examples here.  

    <subsection><h1>Surface normal estimation</h1>
    
      <p>The simplest and most straight-forward algorithm for estimating the normals of a point cloud surface is to find the $k$ nearest neighbors, and fit a plane through those points using least-squares.  Let's denote the nominal point (about which we are estimating the normal) $p_0$, and it's $k$ nearest neighboard $p_i, i\in[1,k].$  Let us form the $k \times 3$ data matrix: $$X = [ p_1 - p_0, p_2 - p_0, ..., p_k - p_0 ].$$  Then the principal components of this data matrix are given by the <a href="https://en.wikipedia.org/wiki/Singular_value_decomposition">singular value decomposition</a> of $X$ (or the eigenvectors of $X^T X$); if $X = U\Sigma V^T$, then the columns of $V$ are the principal directions, with the last column (associated with the smallest singular value) representing the normal direction.  The length can be normalized to one; but keep in mind that the sign of this vector is ambiguous -- the $k$ nearest points don't give any information about what direction might be into the object or out of the object.  This is a case where knowing the original camera direction can help -- for example, PCL offers a method <code>flipNormalTowardsViewpoint</code> that post-processes the normals.</p>

      <figure>
        <img src="http://www.pointclouds.org/assets/images/contents/documentation/features_normal.png" />

        <figcaption>Normal estimation using $k$ nearest neighbords (image linked from the <a href="http://docs.pointclouds.org/trunk/group__features.html"></a>PCL documentation</a>)</figcaption>
      </figure>

      <todo>Add an example; or even support normals in the DepthImageToPointCloud.  Example could be as simple as grabbing one point from the mustard point cloud, and finding it's normal (with everything plotted in matplotlib).</todo>

      <p>What was interesting and surprising (to me) about this is not that the least-squares solution works.  What is surprising is that this operation is considered commonplace and cheap -- for every point in the point cloud, we find $k$ nearest neighbors and do the normal estimation... even for very large point clouds.  Making this performant typically requires data structures like the <a href="https://en.wikipedia.org/wiki/K-d_tree">k-d tree</a> to make the nearest neighbor queries efficient.</p>

      <p>Another standard feature that can be extracted from the $k$ nearest neighbors is the local curvature.  The math is quite similar.  <todo>add it here</todo></p>

    </subsection>

    <subsection><h1>Plane segmentation</h1>
    
      <p>Moving beyond local properties like surface normals and local curvature, we might choose to process a point cloud into primitive shapes -- like boxes or spheres.  Another practical and natural segmentation is segmenting surfaces into planes... which we can accomplish with heuristic algorithms which group neighboring points with similar surface normals.  We used this effectively in the DARPA Robotics Challenge.</p>

      <todo>Insert DRC plane segmentation videos here, from https://drive.google.com/drive/folders/1gYMJ0djBXbevWDBpekkK58pcbtZFr0A0 </todo>

    </subsection>

  </section>
  -->

  <section><h1>Point cloud registration with known correspondences</h1>
    
    <p>Let us begin to address the primary objective of the chapter -- we have a
    known object somewhere in the robot's workspace, we've obtained a point
    cloud from our depth cameras.  How do we estimate the pose of the object,
    $X^O$?</p>

    <p><img style="height:90px;float:left;margin-right:10px"
    src="data/icp_model_points_sketch.png"/>One very natural and well-studied
    formulation of this problem comes from the literature on <a
    href="https://en.wikipedia.org/wiki/Point_set_registration">point cloud
    registration</a> (also known as point set registration or scan matching).
    Given two point clouds, point cloud registration aims to find a rigid
    transform that optimally aligns the two point clouds.  For our purposes,
    that suggests that our "model" for the object should also take the form of a
    point cloud (at least for now).  We'll describe that object with a list of
    <i>model points</i>, $m_i$, with their pose described in the object frame:
    $^OX^{m_i}.$ </p>

    <p>Our second point cloud will be the <i>scene points</i>, $s_i$, that we
    obtain from our depth camera, transformed (via the camera intrinsics) into
    the camera coordinates, $^CX^{s_i}$.  I'll use $N_m$ and $N_s$ for the
    number of model and scene points, respectively.</p>  
      
    <p>Let us assume, for now, that we also know $X^C.$  Is this reasonable? In
    the case of a wrist-mounted camera, this would amount to solving the forward
    kinematics of the arm.  In an environment-mounted camera, this is about
    knowing the pose of the cameras in the world.  Small errors in estimating
    those poses, though, can lead to large artifacts in the estimated point
    clouds.  We therefore take great care to perform <a
    href="station.html">camera extrinsics calibration</a>; I've linked to our
    calibration code in the <a href="station.html">appendix</a>.  Note that if
    you have a mobile manipulation platform (an arm mounted on a moving base),
    then all of these discussions still apply, but you likely will perform all
    of your registration in the robot's frame instead of the world frame.</p>

    <p>The second <b>major assumption</b> that we will make in this section is
    "known correspondences".  This is <i>not</i> a reasonable assumption in
    practice, but it helps us get started.  The assumption is that we know which
    scene point corresponds with which model point.  I'll use a correspondence
    vector $c \in [1,N_m]^{N_s}$, where $c_i = j$ denotes that scene point $s_i$
    corresponds with model point $m_j$.  Note that we do not assume that these
    correspondences are one-to-one.  Every scene point corresponds to some model
    point, but the converse is not true, and multiple scene points may
    correspond to the same model point.</p>

    <p>As a result, the point cloud registration problem is simply an (inverse)
    kinematics problem.  We can write the model points and the scene points in a
    common frame (here the world frame), $$p^{m_{c_i}} = X^O \: {}^Op^{m_{c_i}}
    = X^C \: {}^Cp^{s_i},$$ leaving a single unknown transform, $X^O$, for which
    we must solve.  In the previous chapter I argued for using differential
    kinematics instead of inverse kinematics; why is my story different here?
    Differential kinematics can still be useful for perception, for example if
    we want to track the pose of an object after we acquire its initial pose.
    But unlike the robot case, where we can read the joint sensors to get our
    current state, in perception we need to solve this harder problem at least
    once.</p>

    <p>What does a solution for $X^O$ look like?  Each model point gives us
    three constraints, when $p^{s_i} \in \Re^3.$  The exact number of unknowns
    depends on the particular representation we choose for the pose, but almost
    always this problem is dramatically over-constrained.  Treating each point
    as hard constraints on the relative pose would also be very susceptible to
    measurement noise.  As a result, it will serve us better to try to find a
    pose that describes the data, e.g., in a least-squares sense: $$\min_{X^O
    \in \mathrm{SE}(3)} \sum_{i=1}^{N_s} \| X^O \: {}^Op^{m_{c_i}} - X^C \:
    {}^Cp^{s_i}\|^2.$$  Here I have used the notation $\mathrm{SE}(3)$ for the
    "<a href="http://planning.cs.uiuc.edu/node147.html">special Euclidean
    group</a>," which is the standard notation saying the $X^O$ must be a valid
    rigid transform.</p>

    <p>To proceed, let's pick a particular representation for the pose to work
    with.  I will use 3x3 rotation matrices here; the approach I'll describe
    below also has an equivalent in quaternion form<elib>Horn87</elib>.  To
    write the optimization above using the coefficients of the rotation matrix
    and the translation as decision variables, I have: \begin{align} \min_{p^O,
    R^O} \quad& \sum_{i=1}^{N_s} \| p^O + R^O \: {}^Op^{m_{c_i}} - X^C
    \: {}^Cp^{s_i} \|^2, \\ \subjto \quad& R^T = R^{-1}, \quad \det(R) = +1.
    \nonumber \end{align}  The constraints are needed because not every 3x3
    matrix is a valid rotation matrix; satisfying these constraints ensures that
    $R$ <i>is</i> a valid rotation matrix.  In fact, the constraint $R^T =
    R^{-1}$ is almost enough by itself; it implies that $\det(R) = \pm 1.$ But a
    matrix that satisfies that constraint with $\det(R) = -1$ is called an "<a
    href="https://en.wikipedia.org/wiki/Improper_rotation">improper
    rotation</a>", or a rotation plus a reflection across the axis of
    rotation.</p>

    <p>Let's think about the type of optimization problem we've formulated so
    far.  Given our decision to use rotation matrices, the term $p^O +
    R^O{^Op^{m_{c_i}}} - X^C {^Cp^{s_i}}$ is linear in the decision variables
    (think $Ax \approx b$), making the objective a convex quadratic.  How about
    the constraints?  The first constraint is a quadratic equality constraint
    (to see it, rewrite it as $RR^T=I$ which gives 9 constraints, each quadratic
    in the elements of $R$).  The determinant constraint is cubic in the
    elements of $R$.</p>

    <example id="2D_rotation_only"><h1>Rotation-only point registration in 2D</h1>
        
      <p>I would like you to have a graphical sense for this optimization
      problem, but it's hard to plot the objective function with 9+3 decision
      variables.  To whittle it down to the essentials that I can plot, let's
      consider the case where the scene points are known to differ from the
      model points by a rotation only (this is famously known as <a
      href="https://en.wikipedia.org/wiki/Wahba%27s_problem">the Wahba
      problem</a>).  To further simplify, let's consider the problem in 2D
      instead of 3D.</p>

      <p>In 2D, we can actually linearly parameterize a rotation matrix with
      just two variables:  $$R = \begin{bmatrix} a & -b \\ b & a\end{bmatrix}.$$
      With this parameterization, the constraints $RR^T=I$ and the determinant
      constraints are identical; they both require that $a^2 + b^2 = 1.$</p>

      <figure><img style="width:80%"
      src="data/wahba2d.png"/></figure><figcaption>The $x$ and $y$ axes of this
      plot correspond to the parameters $a$, and $b$, respectively, that define
      the 2D rotation matrix, $R$.  The green quadratic bowl is the objective
      $\sum_i \| R \: {}^O p^{m_{c_i}} - p^{s_i}\|^2,$ and the red open cylinder is
      the rotation matrix constraint.  <a
      href="https://www.geogebra.org/3d/pn628maw">Click here for the interactive
      version.</a></figcaption>

      <p>Is that what you expected?  I generated this plot using just two
      model/scene points, but adding more will only change the shape of the
      quadratic, but not its minimum.  And on the <a
      href="https://www.geogebra.org/3d/pn628maw">interactive version</a> of the
      plot, I've added a slider so that you control the parameter, $\theta$,
      that represents the ground truth rotation described by $R$. Try it
      out!</p>

      <p>In the case with no noise in the measurements (e.g. the scene points
      are exactly the model points modified by a rotation matrix), then the
      minimum of the objective function already satisfies the constraint.  But
      if you add just a little noise to the points, then this is no longer true,
      and the constraint starts to play an important role.</p>
    </example>

    <p>The geometry should be roughly the same in 3D, though clearly in much
    higher dimensions.  But I hope that the plot makes it perhaps a little less
    surprising that this problem has an elegant numerical solution based on the
    singular value decomposition (SVD).</p>
    
    <p>What about translations?  There is a super important insight that allows
    us to decouple the optimization of rotations from the optimization of
    translations.  The insight is this: <i>the <b>relative</b> position between
    points is affected by rotation, but not by translation</i>. Therefore, if we
    write the points relative to some canonical point on the object, we can
    solve for rotations alone.  Once we have the rotation, then we can back out
    the translations easily.  For our least squares objective, there is even a
    "correct" canonical point to use -- the <i>central point</i> (like the
    center of mass if all points have equal mass) under the Euclidean
    metric.</p>

    <p>Therefore, to obtain the solution to the problem, \begin{align}
    \min_{p\in\Re^3,R\in\Re^{3\times3}} \quad & \sum_{i=1}^{N_s} \| p + R
    \: {}^Op^{m_{c_i}} - p^{s_i}\|^2, \\ \subjto \quad & RR^T =
    I,\end{align} first we define the central model and scene points, $\bar{m}$
    and $\bar{s}$, with positions given by $${}^Op^\bar{m} = \frac{1}{N_s}
    \sum_{i=1}^{N_s} {}^Op^{m_{c_i}}, \qquad p^\bar{s} = \frac{1}{N_s}
    \sum_{i=1}^{N_s} p^{s_i}.$$  If you are curious, these come from taking the
    gradient of the <a
    href="https://en.wikipedia.org/wiki/Lagrange_multiplier">Lagrangian</a>
    with respect to $p$, and setting it equal to zero: $$\sum_{i=1}^{N_s} 2 (p +
    R \: {}^Op^{m_{c_i}} - p^{s_i}) = 0 \Rightarrow p^* = \frac{1}{N_s}
    \sum_{i=1}^{N_s} p^{s_i} - R^*\left(\frac{1}{N_s} \sum_{i=1}^{N_s}
    {}^Op^{m_{c_i}}\right),$$ but don't forget the geometric interpretation
    above.  This is just another way to see that we can substitute the
    right-hand side in for $p$ in our objective and solve for $R$ by itself.</p>
    
      
    <p>To find $R$, compose the data matrix $$W = \sum_{i=1}^{N_s} (p^{s_i} -
    p^\bar{s}) ({}^Op^{m_{c_i}} - {}^Op^\bar{m})^T, $$ and use SVD to compute $W
    = U \Sigma V^T$.  The optimal solution is \begin{gather*} R^* = U D V^T,\\
    p^* = p^\bar{s} - R^* {}^Op^\bar{m},\end{gather*} where $D$ is the diagonal
    matrix with entries $[1, 1, \det(UV^T)]$ <elib>Myronenko09</elib>.  This may
    seem like magic, but replacing the singular values in SVD with ones gives
    the optimal projection back onto the orthonormal matrices, and the diagonal
    matrix ensures that we do not get an improper rotation. There many
    derivations available in the literature, but many of them drop the
    determinant constraint instead of handling the improper rotation. See
    <elib>Haralick89</elib> (section 3) for one of my early favorites.</p>  
      
    <example class="drake"><h1>Point cloud registration with known
    correspondences</h1>

      <figure><img style="width:70%; margin-top:-20px; margin-bottom:-20px"
      src="data/pose_from_known_correspondences.svg"/> </figure>

      <p>In the example code, I've made a random object (based on a random set
      of points in the $x$-$y$ plane), and perturbed it by a random 2D
      transform.  The blue points are the model points in the model coordinates,
      and the red points are the scene points.  The green dashed lines represent
      the (perfect) correspondences.  On the right, I've plotted both points
      again, but this time using the estimated pose to put them both in the
      world frame.  As expected, the algorithm perfectly recovers the ground
      truth transformation.</p>

      <p><a target="pose"
      href="https://colab.research.google.com/github/RussTedrake/manipulation/blob/master/pose.ipynb#scrollTo=9QDyRMCyb_gL"><img
      src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open
      In Colab"/></a>
          </p>

    </example>

    <todo>Add exercise on the reflection case.</todo>

    <todo>Confirm or disprove: i think that this is just the least squares
    solution to the unconstrained problem, then projected back to the
    orthonormal matrices (with the SVD).</todo>

    <p>What is important to understand is that once the correspondences are
    given we have an efficient and robust numerical solution to estimating the
    pose.</p>

  </section>

  <section><h1>Iterative Closest Point (ICP)</h1>

    <p>So how do we get the correspondences?  In fact, if we were given the pose
    of the object, then figuring out the correspondences is actually easy: the
    corresponding point on the model is just the nearest neighbor / closest
    point to the scene point when they are transformed into a common frame.</p>

    <p>This observation suggests a natural iterative scheme, where we start with
    some initial guess of the object pose and compute the correspondences via
    closest points, then use those correspondences to update the estimated pose.
    This is one of the famous and often used (despite its well-known
    limitations) algorithms for point cloud registration: the iterative closest
    point (ICP) algorithm.</p>

    <p>To be precise, let's use $\hat{X}^O$ to denote our estimate of the object
    pose, and $\hat{c}$ to denote our estimated correspondences.  The
    "closest-point" step is given by $$\forall i, \hat{c}_i = \argmin_{j \in
    N_m}\| \hat{X}^O \: {}^Op^{m_{c_i}} - p^{s_i} \|^2.$$  In words, we want to
    find the point in the model that is the closest in Euclidean distance to the
    transformed scene points.  This is the famous "nearest neighbor" problem,
    and we have good numerical solutions (using optimized data structures) for
    it.  For instance, Open3D uses <a
    href="https://github.com/mariusmuja/flann">FLANN</a><elib>Muja09</elib>.</p>

    <p>Although solving for the pose and the correspondences jointly is very
    difficult, ICP leverages the idea that if we solve for them independently,
    then both parts have good solutions.  Iterative algorithms like this are a
    common approach taken in optimization for e.g. bilinear optimization or
    expectation maximization.  It is important to understand that this is a
    local solution to a non-convex optimization problem.  So it is subject to
    getting stuck in local minima.</p>

    <example class="drake"><h1>Iterative Closest Point</h1>
  
      <figure>
        <iframe style="border:0;height:300px;width:400px;" src="data/iterative_closest_point.html?height=240px"></iframe>
      </figure>
  
      <p>Here is ICP running on the random 2D objects.  Blue are the model
      points, red are the scene points, green dashed lines are the
      correspondences.  I encourage you to run the code yourself.</p>  
            
      <p>I've included one of the animation where it happens to converge to the
      true optimal solution.  But it gets stuck in a local minima more often
      than not!  I hope that stepping through will help your intuition.
      Remember that once the correspondences are correct, the pose estimation
      will recover the exact solution.  So every one of those steps before
      convergence has at least a few bad correspondences.  Look closely!</p>
  
      <p><a target="pose"
      href="https://colab.research.google.com/github/RussTedrake/manipulation/blob/master/pose.ipynb#scrollTo=AHfxMwrvb1mz"><img
      src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open
      In Colab"/></a></p>
  
    </example>      
      
    <!--<todo>Thanks to the geometric nature of the problem, many of these local
      minima are graphical and intuitive.  (Will insert a few examples
      here).</todo> -->

    <p>Intuition about these local minima has motivated a number of ICP
    variants, including point-to-plane ICP, normal ICP, ICP that use color
    information, feature-based ICP, etc.</p>

 <!---     
      <p>Maybe more fundamental, though, is the fact that I don't think the ICP objective, even if we could optimize it perfectly is quite right.  (TODO: add the example of a long thin book).  And there is information that we have, which should aid in estimating the pose, which is not present in the point cloud (like the location of the cameras).</p>
-->

  </section>

  <!--
  <section><h1>Pairwise distance</h1>
    
  </section>
-->

  <section><h1>Dealing with partial views and outliers</h1>

    <p>The example above is sufficient to demonstrate the problems that ICP can
    have with local minima.  But we have so far been dealing with unreasonably
    perfect point clouds.  Point cloud registration is an important topic in
    many fields, and not all of the approaches you will find in the literature
    are well suited to dealing with the messy point clouds we have to deal with
    in robotics.</p>

    <example class="drake"><h1>ICP with messy point clouds</h1>
  
      <p>I've added a number of parameters for you to play with in the notebook
      to add outliers (scene points that do not actually correspond to any model
      point), to add noise, and/or to restrict the points to a partial view.
      Please try them by themselves and in combination.</p>

      <!-- todo: update the hash -->
      <p><a target="pose"
        href="https://colab.research.google.com/github/RussTedrake/manipulation/blob/master/pose.ipynb#scrollTo=AHfxMwrvb1mz"><img
        src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open
        In Colab"/></a></p>
  
      <figure style="margin-top:-40px;">
        <iframe style="border:0;height:300px;width:400px;" src="data/icp_outliers.html?height=240px"></iframe>
        <figcaption>A sample run of ICP with outliers (modeled as additional
        points drawn from a uniform distribution over the
        workspace).</figcaption>
      </figure>
  
      <figure>
        <iframe style="border:0;height:300px;width:400px;" src="data/icp_noise.html?height=240px"></iframe>
        <figcaption>A sample run of ICP with Gaussian noise added to the positions of the scene points.</figcaption>
      </figure>

      <figure>
        <iframe style="border:0;height:300px;width:400px;" src="data/icp_partial_view.html?height=240px"></iframe>
        <figcaption>A sample run of ICP with scene points limited to a partial view.  They aren't all this bad, but when I saw that one, I couldn't resist using it as the example!</figcaption>
      </figure>
  
    </example>      

    <subsection><h1>Detecting outliers</h1>
    
      <p>Once we have a reasonable estimate, $X^O$, then detecting outliers is
      straight-forward.  Each scene point contributes a term to the objective
      function according to the (squared) distance between the scene point and
      its corresponding model point.  Scene points that have a large distance
      can be labeled as outliers and removed from the point cloud, allowing us
      to refine the pose estimate without these distractors.  You will find that
      many ICP implementations, such as <a
      href="http://www.open3d.org/docs/release/python_api/open3d.pipelines.registration.registration_icp.html">this
      one from Open3d</a> include a "maximum correspondence distance" parameter
      for this purpose.</p>

      <p>This leaves us with a "chicken or the egg" problem -- we need a
      reasonable estimate of the pose to detect outliers, but for very messy
      point clouds it may be hard to get a reasonable estimate in the presence
      of those outliers.  So where do we begin?  One common approach that was
      traditionally used with ICP is an algorithm called RANdom SAmple Consensus
      (RANSAC) <elib>Fischler81</elib>.  In RANSAC, we select a number of random
      (typically quite small) subsets of "seed" points, and compute a pose
      estimate for each.  The subset with the largest number of "inliers" --
      points that are below the maximum correspondence distance -- is considered
      the winning set and can be used to start the ICP algorithm.
      </p>

      <todo>RANSAC example</todo>

      <p>There is another important and clever idea to know.  Remember the
      important observation that we can decouple the solution for rotation and
      translation by exploiting the fact that the <i>relative</i> position
      between points depends on the rotation but is invariant to translation.
      We can take that idea one step further and note that the <i>relative
      <b>distance</b></i> between points is actually invariant to both rotation
      <i>and</i> translation.  This has traditionally been used to separate the
      estimation of scale from the estimation of translation and rotation (I've
      decided to ignore scaling throughout this chapter, since it doesn't make
      immediate sense for our proposed problem).  In <elib>Yang20a</elib> they
      proposed that this can also be used to prune outliers even before we have
      an initial pose estimate.  If the distance between a <i>pair</i> of points
      in the scene is unlike any distance between any pair of points in the
      model, then one of those two points is an outlier.  Finding the maximum
      set of inliers by this metric would correspond to finding the maximum
      clique in a graph.  Unfortunately, finding optimal cliques is probably too
      expensive for large point clouds, so in <elib>Yang20a</elib> conservative
      approximations are used instead.</p>

    </subsection>

    <subsection><h1>Point cloud segmentation</h1>
    
      <p>Not all outliers are due to bad returns from the depth camera, very
      often they are due to other objects in the scene.  Even in the most basic
      case, our object of interest will be resting on a table or in a bin.  If
      we know the geometry of the table/bin a priori, then we can subtract away
      points from that region.  Alternatively, we can <a
      href="http://www.open3d.org/docs/release/tutorial/geometry/pointcloud.html#Crop-point-cloud">crop</a>
      the point cloud to a region of interest.  This can be sufficient for very
      simple or uncluttered scenes, and will be just enough for us to accomplish
      our goal for this chapter.</p>

      <p>More generally, if we have multiple objects in the scene, we may still
      want to find the object of interest (mustard bottle?).  If we had truly
      robust point cloud registration algorithms, then one could imagine that
      the optimal registration would allow us to correspond the model points
      with just a subset of the scene points; point cloud registration could
      solve the object detection problem!  Unfortunately, our algorithms aren't
      strong enough.</p>

      <figure><img src="data/labelfusion_multi_object_scenes.png"/>
        <figcaption>A multi-object scene from <a
        href="http://labelfusion.csail.mit.edu/">LabelFusion</a>
        <elib>Marion17</elib>.</figcaption>
      </figure>

      <p>As a result, we almost always use some other algorithm to "segment" the
      scene in a number of possible objects, and run registration independently
      on each segment.  There are numerous geometric approaches to
      segmentation<elib>Nguyen13</elib>.  But these days we tend to use neural
      networks for segmentation, so I won't go into those details here.</p>
      
      <todo> normal estimation, plane fitting, etc.</todo>

    </subsection>

    <subsection><h1>Generalizing correspondence</h1>
    
      <p>The problem of outliers or multiple objects in the scene challenges our
      current approach to correspondences.  So far, we've used the notation $c_i
      = j$ to represent the correspondence of <i>every</i> scene point to a
      model point. But note the asymmetry in our algorithm so far (scene
      $\Rightarrow$ model). I chose this direction because of the motivation of
      partial views from the mustard bottle example from the chapter opening.
      Once we start thinking about tables and other objects in the scene,
      though, maybe we should have done model $\Rightarrow$ scene?  You should
      verify for yourself that this would be a minor change to the ICP
      algorithm.  But what if we have multiple objects <i>and</i> partial
      views?</p>

      <p>It would be simple enough to allow $c_i$ to take a special value for
      "no correspondence", and indeed that helps.  In that case, we would hope
      that, at the optimum, the scene points corresponding to the model of
      interest would be mapped to their respective model points, and the scene
      points from outliers and other objects get labelled as "no
      correspondence".  The model points that are from the occluded parts of the
      object would simply not have any correspondences associated with them.</p>
      
      <p>There is an alternative notation that we can use which is slightly more
      general.  Let's denote the correspondence matrix $C\in \{0,1\}^{N_s \times
      N_m}$, with $C_{ij} = 1$ iff scene point $i$ corresponds with model point
      $j$.  Using this notation, we can write our estimation step (given known
      correspondences) as: $$\min_{X^O\in \mathrm{SE}(3)} \sum_{i=1}^{N_s}
      \sum_{j=1}^{N_m} C_{ij} \|X^O \: {}^Op^{m_j} - p^{s_i}\|^2.$$ This
      subsumes our previous approach: if $c_i = j$ then set $C_{ij} = 1$ and the
      rest of the row equal to zero.  If the scene point $i$ is an outlier, then
      set the entire row to zero.  The previous notation was a more compact
      representation for the true asymmetry in the "closest point" computation
      that we did above, and made it clear that we only needed to compute $N_s$
      distances.  But this more general notation will take us to the next
      round.</p>
    
    </subsection>

    <subsection><h1>Soft correspondences</h1>

      <p>What if we relax our strict binary notion of correspondences, and think
      of them instead as correspondence weights $C_{ij} \in [0,1]$ instead of
      $C_{ij} \in \{0, 1\}$?  This is the main idea behind the "coherent point
      drift" (CPD) algorithm <elib>Myronenko10</elib>.  If you read the CPD
      literature you will be immersed in a probabilistic exposition using a
      Gaussian Mixture Model (GMM) as the representation.  The probabilistic
      interpretation is useful, but don't let it confuse you; trust me that it's
      the same thing.  To maximize the likelihood on a Gaussian, step one is
      almost always to take the $\log$ because it is a monotonic function, which
      gets you right back to our quadratic objective.</p>

      <div>The probabilistic interpretation does give a natural mechanism for
      setting the correspondence weights on each iteration of the algorithm, by
      thinking of them as the probability of the scene points given the model
      points: \begin{equation} C_{ij} = \frac{1}{a_i}
      \exp^{-\frac{1}{2}\left\|\frac{X^O \: {}^Op^{m_j} - p^{s_i}}{\sigma^2}
      \right\|^2},\end{equation} which is the standard density function for a
      Gaussian, $\sigma$ is the and $a_i$ is the proper normalization constant
      to make the probabilities sum to one.  It is also natural to encode the
      probability of an outlier in this formulation (it simply modifies the
      normalization constant).  <details><summary>Click to see the normalization
      constant, but it's really an implementation detail.</summary> The
      normalization works out to be \begin{equation}a_i = \sum_{j=1}^{N_m}
      \exp^{-\frac{1}{2}\left\|\frac{X^O \: {}^Op^{m_j} - p^{s_i}}{\sigma^2}
      \right\|^2} + (2\pi\sigma^2)^\frac{3}{2} \frac{w}{1-w}\frac{N_m}{N_s},
      \end{equation} with $0 \le w \le 1$ the probability of a sample point
      being an outlier <elib>Myronenko10</elib>.</details></div>

      <p>The CPD algorithm is now very similar to ICP, alternating between
      assigning the correspondence weights and updating the pose estimate.  The
      pose estimation step is almost identical to the procedure above, with the
      "central" points now $${^Op^{\bar{m}}} = \frac{1}{N_C} \sum_{i,j} C_{ij}
      \: {}^Op^{m_j}, \quad p^{\bar{s}} = \frac{1}{N_C} \sum_{i,j} C_{ij} p^{s_i},
      \quad N_C = \sum_{i,j} C_{ij},$$ and the data matrix now: $$W = \sum_{i,j}
      C_{ij} \left(p^{s_i} - p^{\bar{s}}\right) \left({}^Op^{m_j} -
      {}^Op^{\bar{m}}\right)^T.$$  The rest of the updates, for extracting $R$
      using SVD and solving for $p$ given $R$, are the same.  You won't see the
      sums explicitly in the code, because each of those steps has a
      particularly nice matrix form if we collect the points into a matrix with
      one point per column.</p>

      <p>The probabilistic interpretation also gives us a strategy for
      determining the covariance of the Gaussians on each iteration.
      <elib>Myronenko10</elib> derives the $\sigma$ estimate as: (TODO: finish
      converting this part of the derivation)<!-- $$\sigma^2 = \frac{1}{3 N_C}
      \sum_{i,j} C_{ij} \left[\left(p^{s_i} - p^{\bar{s}}\right)^T \left(p^{s_i}
      - p^{\bar{s}}\right) +  \right].$$ --> </p>

      <todo>Example with CPD</todo>

      <p>The word on the street is the CPD is considerably more robust than ICP
      with its hard correspondences and quadratic objective; the Gaussian model
      mitigates the effect of outliers by setting their correspondence weight to
      nearly zero.  But it is also more expensive to compute all of the pairwise
      distances for large point clouds.  In <elib>Gao18a</elib>, we point out
      that a small reformulation (thinking of the scene points as inducing a
      distribution over the model points, instead of vice versa) can get us back
      to summing over the scene points only.  It enjoys many of the robustness
      benefits of CPD, but also the performance of algorithms like ICP.</p>

      <todo>Example with FilterReg</todo>

    </subsection>

    <subsection><h1>Nonlinear optimization</h1>

      <p>All of the algorithms we've discussed so far have exploited the SVD
      solution to the pose estimate given correspondences, and alternate between
      estimating the correspondences and estimating the pose.  There is another
      important class of algorithms that attempt to solve for both
      simultaneously.  This makes the optimization problem nonconvex, which
      suggests they will still suffer from local minima like we saw in the
      iterative algorithms.  But many authors argue that the solution times
      using nonlinear solvers can be on par with the iterative algorithms (e.g.
      <elib>Fitzgibbon03</elib>).</p>

      <figure><img style="width:80%" src="data/robust_kernels.png"><figcaption>A
        variety of objective functions for robust registration.  <a
        href="https://www.geogebra.org/calculator/hcgacefe">Click here for the
        interactive version</a>.</figcaption></figure>

      <p>Another major advantage of using nonlinear optimization is that these
      solvers can accommodate a wide variety of objective functions to achieve
      the same sort of robustness to outliers that we saw from CPD.  I've
      plotted a few of the popular objective functions above.  The primary
      characteristic of these functions is that they taper, so that larger
      distances eventually do not result in higher cost.  Importantly, like in
      CPD, this means that we can consider all pairwise distances in our
      objective, because if outliers add a constant value (e.g. 1) to the cost,
      then they have no gradient with respect to the decision variables and no
      impact on the optimal solution.  We can therefore write, for our favorite
      loss function, $l(x)$, an objective of the form, e.g. $$\min \sum_{i,j}
      \left[ l\left(\| X^O \: {}^Op^{m_j} - p^{s_i} \|\right)\right].$$ And what
      are the decision variables?  We can also exploit the additional
      flexibility of the solvers to use minimal parameterizations -- e.g.
      $\theta$ for rotations in 2D, or Euler angles for rotations in 3D.  The
      objective function is more complicated but we can get rid of the
      constraints.</p>

      <example><h1>Nonlinear pose estimation with known correspondences</h1>

        <p>To understand precisely what we are giving up, let's consider a
        warm-up problem where we use nonlinear optimization on the minimal
        parameterizations for the pose estimation problem.  As long as we don't
        add any constraints, we can still separate the solutions for rotation
        from the solutions from translation.  So consider the problem:
        $$\min_\theta \sum_{j} \left\| \begin{bmatrix} \cos\theta & -\sin\theta
        \\ \sin\theta & \cos\theta \end{bmatrix} \left({}^Op^{m_{c_i}} -
        {}^Op^{\bar{m}}\right) - \left( p^{s_i} -
        p^{\bar{s}}\right)\right\|^2.$$ We now have a complex, nonlinear
        objective function.  Have we introduced local minima into the
        problem?</p>

        <p><img
        style="height:100px;float:right;margin-left:10px;margin-right:10px"
        src="data/model_points_circle.png"/> As a thought experiment, consider
        the problem where all of our model points lie on a perfect circle.  For
        any one scene point $i$, in the rotation-only optimization, the worst
        case is when our estimate $\theta$ is 180 degrees ($\pi$ radians) away
        from the optimal solution.  The cost for that point would be $4r^2$,
        where $r$ is the radius of the circle.  In fact, using the law of
        cosines, we can actually write the squared distance for the point for
        any error, $\theta_{err} = \theta - \theta^*,$ as $$\text{distance}^2 =
        r^2 + r^2 - 2 r^2 \cos\theta_{err}.$$  And in the case of the circle,
        every other model point contributes the same cost. This is not a convex
        function, but every minima is a globally optimal solution (just wrapped
        around by $2\pi$.</p>
        
        <p>In fact, even if we have a more complicated, non-circular, geometry,
        then this same argument holds.  Every point will incur and error as it
        rotates around the circle, but may have a different radius.  The error
        for all of the model points will decrease as $\cos\theta_{err}$ goes to
        one.  So every minima is a globally optimal solution.  We haven't
        actually introduced local minima (yet)!</p>

        <p>It is the correspondences, and other constraints that we might add to
        the problem, that really introduce local minima.</p>

      </example>
<!--
      <example><h1>Nonlinear optimization for pose estimation</h1>

      </example>
-->
      <subsubsection><h1>Precomputing distance functions</h1>
      
        <p>There is an exceptionally nice trick for speeding up the computation
        in these nonlinear optimizations, but it does require us to go back to
        thinking about the minimum distance from a scene point to the model, as
        opposed to the pairwise distance between points.  This, I think, it not
        a big sacrifice now that we have stopped enumerating correspondences
        explicitly.  Let's slightly modify our objective above to be of the form
        $$\min \sum_{i} \left[ l\left( \min_j \| X^O \: {}^Op^{m_j} - p^{s_i}
        \|\right)\right].$$  In words, we can an apply arbitrary robust loss function, but we will only apply it to the minimum distance between the scene point and the model.</p>

        <p>The nested $\min$ functions look a little intimidating from a
        computational perspective.  But this form, coupled with the fact that in
        our application the model points are fixed, actually enables a fantastic
        optimization.  First, let's move our optimization into the model frame
        by changing the inner term to $$\min_j \|{}^Op^{m_j} - {}^OX^W
        p^{s_i}\|.$$  Now realize that for any 3D point $x$, we can precompute
        the minimum distance from $x$ to any point on our model, call it
        $\phi(x).$  The term above is just $\phi({^OX^Wp^{s_i}}).$ This function
        of 3D space, sometimes called a <a
        href="https://en.wikipedia.org/wiki/Distance_transform">distance
        field</a>, and the closely related signed distance function (SDF) and <a
        href="https://en.wikipedia.org/wiki/Level-set_method">level sets</a> are
        a common representation in geometric modeling.  And they are precisely
        what we need to quickly compute the cost from many scene points.</p>

        <figure><img width="48%" src="data/rectangle_points_distance.svg"/><img
        width="48%"
        src="data/rectangle_signed_distance.svg"/><figcaption>Contour plot of
        the distance function for our point-based representation of the rectange
        (left), and the <i>signed</i> distance function for our "mesh-based" 2D
        rectangle.  The smoother distance contours in the mesh-based
        representation can help alleviate some of the local minima problems we
        observed above.</figcaption></figure>

        <figure><img width="60%" src="data/deep_sdf_bunny.png"><figcaption>A
        visualization of a "signed distance function" (SDF) representation of 3D
        geometry.  Reproduced with permissions from
        <elib>Park19</elib>.</figcaption></figure>
      
        <p>As the figure I've included above suggests, we can use the
        precomputed distance functions to represent the minimum distance to a
        mesh model instead of just a point cloud.  And with some care, we can
        define alternative distance-like functions in 3D space that have smooth
        subgradients which can work better with nonlinear optimization.  There
        is a rich literature on this topic; see <elib>Osher03</elib> for a nice
        reference.</p>

      </subsubsection>

    </subsection>


    <subsection><h1>Global optimization</h1>

      <p>Is there any hope of exploiting the beautiful structure that we found
      in the pose estimation with known correspondences with an algorithm that
      searches for correspondences and pose simultaneously?</p>

      <p>There are some algorithms that claim global optimality for the ICP
      objective, typically using a branch and bound strategy
      <elib>Gelfand05+Mellado14+Yang15</elib>, but we have been disappointed
      with the performance of these on real point clouds.  We proposed another,
      using branch and bound to explicitly enumerate the correspondences with a
      truncated least-squares objective, and a tight relaxation of the SE(3)
      constraints<elib>Izatt17b</elib>, but this method is still limited to
      relatively small problems.</p>

      <p>In recent work, TEASER <elib>Yang20a</elib> takes a different approach,
      using the observation that truncated least squares can be optimized
      efficiently for the case of scale and translation.  For rotations, they
      solve a semi-definite relaxation of the truncated least-squares objective.
      This relaxation is not guaranteed to be tight, but (like all of the convex
      relaxations we describe in this chapter), it is easy to certify after the
      fact whether the optimal solution satisfied the original constraints.</p>

    </subsection>
    
  </section>

  <section><h1>Non-penetration and "free-space" constraints</h1>
  
    <p>We've explored two main lineages of algorithms for the pose estimation --
    one based on the beautiful SVD solutions and the other based on nonlinear
    optimization.  As we will see, non-penetration and free-space constraints
    are, in most cases, non-convex constraints, so are a better match for the
    nonlinear optimization approach.  But there are some examples of convex
    non-penetration constraints (e.g., when points must not penetrate a
    half-plane) and it <i>is</i> possible to include these in our convex
    optimization approach.  I'll just show it in a simple example.</p>

    <example><h1>Non-penetration of a half-plane via convex optimization</h1>
    
      <p>For simplicity, I will restrict this example to 2D, where we can
      leverage our convenient form of the rotation matrices: $$R =
      \begin{bmatrix} a & -b \\ b & a \end{bmatrix}.$$  Recall that the rotation
      matrix constraints for this form reduce to $a^2+b^2=1$.  These are
      non-convex constraints, but we can relax them to $a^2 + b^2 \le 1.$  This
      is almost exactly the problem we visualized in <a
      href="#2D_rotation_only">this earlier example</a>, except that we will add
      translations here.  The relaxation is exactly changing the non-convex unit
      circle constraint into the convex unit disk constraint.  Based on that
      visualization, we have reason to be optimistic that the unit disk
      relaxation could be tight.</p>

      <p>Let's solve the following optimization: \begin{align*} \min_{p,a,b}
      \quad & \sum_i \| p + R \: {}^Op^{m_{c_i}} - p^{s_i} \|^2, \\ \subjto
      \quad & a^2 + b^2 \le 1, \\ & {}^W p^{m_i} \ge 0, \quad \forall i \in [1,
      N_m], \end{align*} where $R$ depends on $a, b$ as above. Just as the
      addition of constraints forced us to move from the pseudo-inverse solution
      to a numerical optimization-based solution in the last chapter, the
      solution to this problem is no longer given simply by SVD.  As formulated,
      this optimization falls under the category of a Second-Order Cone Program
      (SOCP), though we need to use a slack variable to put it into the standard
      form with a linear objective.</p>

      <p><a target="pose"
        href="https://colab.research.google.com/github/RussTedrake/manipulation/blob/master/pose.ipynb#scrollTo=pO8WTJjEWJ0k"><img
        src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open
        In Colab"/></a></p>
      
      <p>Please be careful.  Now that we have a constraint that depends on $p$,
      our original approach to solving for $R$ independently is no longer valid
      (at least not without modification).  I've provided a simple implementation in colab.</p>
      
      <!--To add the objective efficiently in
      code, we'll use properties of the matrix trace and $R^TR=I$ to observe
      that \begin{equation} \sum_i \| p + R \: {}^Op^{m_{c_i}} - p^{s_i} \|^2 =
      \| P + R M - S\|_F^2,\end{equation} where $P$, $M$, and $S$, are 2xN
      matrices formed by horizontally stacking $p$, ${}^Op^{m_{c_i}}$, and
      $p^{s_i}$, respectively.-->

      <figure style="margin-top:-20px"><img style="width:70%"
      src="data/constrained_convex_pose.svg"/><figcaption>The convex
      approximation to the constrained pose estimation problem with known
      correspondences.  I have added constraints that both the $x$ and $y$
      positions of all of the points in the estimated pose must be greater than
      $0$ in the world frame (as illustrated by the green
      lines).</figcaption></figure>
      
      <p>This is a slightly exaggerated case, where the scene points are really
      pulling the box into the constraints.  We can see that the relaxation is
      <i>not</i> tight in this situation; the box is being shrunk slightly to
      explain the data.</p>
      
      <p>Constrained optimizations like this can be made relatively efficient,
      and are suitable for use in an ICP loop for modest-sized point clouds. But
      the major limitation is that we can only take this approach for convex
      non-penetration constraints (or convex approximations of the constraints),
      like the "half-plane" constraints we used here.  It is probably not very
      suitable for non-penetration between two objects in the scene.</p>

      <p>For the record, the fact that I was able to easily add this example
      here is actually pretty special.  I don't know of another toolbox that
      brings together the advanced optimization algorithms with the geometry /
      physics / dynamical systems in the way that Drake does.</p>

    </example>

    <subsection><h1>Free space constraints as non-penetration constraints</h1>

      <p>There is another beautiful idea that I first saw in
      <elib>Schmidt15</elib>...</p>
    </p>

      <figure><img width="60%" src="data/free_space.svg"/><figcaption>The rays
      cast from the camera to the depth returns define a "free space obstacle"
      that other objects should not penetrate.</figcaption></figure>
    
    </subsection>
  
    <p>Ultimately, as much as I prefer the convex formulations with their
    potential for deeper understanding (by us) and for stronger guarantees (for
    the algorithms), the ability to add non-penetration constraints and
    free-space constraints is simply too valuable to ignore.  I hope that, if
    you come back to these notes in a year or two, I will be able to report that
    we have strong results for these more complex formulations.  But for now, in
    most applications, I will steer you towards the nonlinear optimization
    approaches and taking advantage of these constraints.</p>

  </section>

  <section><h1>Putting it all together</h1>
  
  </section>
  
  <section><h1>Looking ahead</h1>
  
    <p>Although I very much appreciate the ability to think rigorously about
    these geometric approaches, we need to think critically about whether we are
    solving for the right objectives.</p>
    
    <p>Even if we stay in the realm of pure geometry, it is not clear that a
    least-squares objective with equal weights on the points is correct. Imagine
    a tall skinny book laying flat on the table -- we might only get a very
    small number of returns from the edges of the book, but those returns
    contain proportionally much more information than the slew of returns we
    might get from the book cover.  It is no problem to include relative weights
    in the estimation objectives we have formulated here, but we don't yet have
    very successful geometry-based algorithms for deciding what those weights
    should be in any general way.  (There has been a lot of research in this
    direction, but it's a hard problem.)</p>

    <p>Please realize, though, that as beautiful as geometry is, we are so far
    all but ignoring the most important information that we have: the color
    values!  While it is possible to put color and other features into an
    ICP-style pipeline, it is very hard to write a simple distance metric in
    color space (the raw color values for a single object might be very
    different in different lighting conditions, and the color values of
    different objects can look very similar).  Advances in computer vision,
    especially those based on deep learning over the last few years, have
    brought new life to this problem.  When I asked my students recently "If you
    had to give up one of the channels, either depth or color, which would you
    give up?" the answer was a resounding "I'd give up depth; don't take away my
    color!"  That's a big change from just a few years ago.</p>

    <p>But deep learning and geometry can (should?) work nicely together. For
    example, identifying correspondences between point clouds has been a major
    theme in this chapter -- and we haven't completely solved it.  This is one
    of the many places where the color values and deep learning have a lot to
    offer <elib>Florence18a</elib>.  We'll turn our attention to them soon!</p>

  </section>
  
  <section id="exercises"><h1>Exercises</h1>
  
    <exercise><h1>How many points do you need?</h1>
    
      <p>Consider the problem of point cloud registration with known
      correspondences.  I said that, in most cases, we have far more points than
      we have decision variables, and therefore treating each as an equality
      constraint would make the problem over-constrained.  That raises a natural
      question: what is the minimal number of points required to uniquely
      specify a pose?  In 2D?  In 3D?  (The answer might not be what you might
      first think!)</p>

      <!-- 2 points is sufficient in 2D.  3 in 3D.  The disconnect to our
      intuition (e.g. how many points required to specify a plane), is because
      these points have unique IDs from the correspondences.  And reflections
      aren't allowed.  -->
    
    </exercise>

    <exercise><h1>Rotational symmetries</h1>

      <p>We have a beautiful formulation for the point cloud registration
      problem with known correspondences -- it has a quadratic objective and
      (with the determinant constraint relaxed) a quadratic equality constraint.
      Sometimes we have objects that are rotationally symmetric -- think of box
      that is a perfect cube.  How can the quadratic objective capture a problem
      where there should be equally good solutions at rotations that are 90
      degrees apart?</p>

      <!-- Answer:  It can't!  And this can't happen in our formulation, because
      there is no such thing as rotational symmetry when we've assumed known and
      one-to-one correspondences. -->

    </exercise>

    <exercise><h1>ICP with random initializations</h1>

      <p>If you run the ICP example code which has randomized object geometry
      and ground truth object pose, you can't help but notice that the algorithm
      does a quite reasonable job of estimating translation, but (at least for
      the geometries I've generated here) does a fairly lousy job with rotation.
      One mitigation is to run ICP multiple times with different initial
      estimates of the rotation.  This is a reasonable strategy for any
      nonconvex optimization problem with local minima, but is particularly
      useful here since even when the point clouds are quite complex, the
      dimensionality of the search space is low.  In particular, we can generate
      a reasonable coverage of 2D or even 3D rotations with a modest number of
      samples (for 3D, consider using
      <code>UniformlyRandomRotationMatrix</code>).  Furthermore, running ICP
      from multiple initial conditions can be done in parallel.</p>

      <p>Try implementing ICP from multiple initial estimates $\hat{X}^O$, sampling only in rotation.  If you only keep the best one (lowest estimation error), then how much does it improve the performance over a single ICP run?</p>

    </exercise>

    <exercise><h1>Planar Two-Point ICP</h1>
      <p>We saw that ICP has many local minimas. But what are these local minimas, and can we say something about how wrong our initial guess has to be until we arrive at one of these local minima? Although the analysis can get complicated for general geometry, let's start by analyzing a simple example of a planar two-point ICP.</p> 

      <figure><img style="width:90%", src="figures/exercises/icp_correspondence.png"/>
      </figure>

        
      <p>The model points and the scene points are given as:</p>

      $$p^{m_1}=\begin{pmatrix} 1 \\ 0 \end{pmatrix} \qquad   p^{m_2}=\begin{pmatrix} -1 \\ 0 \end{pmatrix} \qquad p^{s_1}=\begin{pmatrix} 1 \\ 0 \end{pmatrix} \qquad p^{s_2}=\begin{pmatrix} -1 \\ 0 \end{pmatrix}.$$

      <p>The true correspondence is given by their numbering, but note that we don't know the true correspondence - ICP simply determines it based on nearest neighbors. Given our vanilla-ICP cost (sum of pairwise distances squared), we can parametrize the 2D pose $X^O$ with $p_x,p_y,a,b$ and write down the resulting optimization as:</p>
      $$\begin{aligned}
        \min_{p_x,p_y,a,b} \quad & \sum_i \bigg\| \begin{pmatrix} p_x \\ p_y \end{pmatrix} + \begin{pmatrix} a & -b \\ b & a \end{pmatrix} p^{m_i} - p^{s_i} \bigg\|^2 \\
        \text{s.t.} \quad & a^2 + b^2 = 1. \end{aligned}$$

      <ol type="a">
        <li> When the initial guess for the pose results in the correct correspondences based on nearest-neighbors, show that the ICP cost is minimum when $p_x,p_y,b=0$ and  $a=1$. Describe the set of initial poses that results in convergence to the true solution. </li>
        <li> When the initial guess results in a flipped correspondence, show that ICP cost is minimum when $p_x,p_y,b=0$ and $a=-1$. Describe the set of initial poses that results in convergence to this incorrect solution.</li>
        <li> Remember that correspondence need not be one-to-one. In fact they are often not when computed based on nearest-neighbors. By constructing the data matrix $W$, show that when both scene points correspond to one model point, ICP gets stuck and does not achieve zero-cost. (HINT: You may assume that doing SVD on a zero matrix leads to identity matrices for $U$ and $V$).</li>
      </ol>
    </exercise>

    <exercise><h1>Bunny ICP</h1>
      <p>For this exercise, you will implement the ICP algorithm to match pointclouds of two Stanford bunnies. You will work exclusively in <a href="https://colab.research.google.com/github/RussTedrake/manipulation/blob/master/exercises/pose/bunny_icp.ipynb" target="_blank">this notebook</a>. You will be asked to complete the following steps:</p>
      <ol type="a">
        <li> Implement a method to compute the least-squares transform given the correspondences. </li>
        <li> Implement the ICP algorithm using the least square transform method from part a. </li>
      </ol>    
    </exercise>

    <exercise><h1>RANSAC</h1>
      <p>For this exercise, you will remove the environmental background from the Stanford bunny pointcloud using the RANSAC algorithm. You will work exclusively in <a href="https://colab.research.google.com/github/RussTedrake/manipulation/blob/master/exercises/pose/ransac.ipynb" target="_blank">this notebook</a>. You will be asked to complete the following steps:</p>
      <ol type="a">
        <li> Implement the RANSAC algorithm. </li>
        <li> Use the RANSAC algorithm to remove the planar surface from the scene point cloud.</li>
      </ol>    
    </exercise>

    <exercise><h1>Pose Estimation</h1>
      <p>Let's go back to the example from Chapter 3, but relax the assumption that we have the pose of the red foam brick. You will be given a simulated raw pointcloud from our previous setup from a depth camera. Your task is to perform segmentation on this raw pointcloud data, and perform ICP to estimate the pose of the brick. You will work exclusively in <a href="https://colab.research.google.com/github/RussTedrake/manipulation/blob/master/exercises/pose/pose_estimation_icp.ipynb" target="_blank">this notebook</a>. You will be asked to complete the following steps:</p>
      <ol type="a">
        <li> Perform segmentation on the raw pointcloud to remove the background. </li>
        <li> Tweak an existing ICP implementation to correctly estimate the pose of the red foam brick.</li>
      </ol>    
    </exercise>

  </section>

</chapter>
<!-- EVERYTHING BELOW THIS LINE IS OVERWRITTEN BY THE INSTALL SCRIPT -->

<table style="width:100%;"><tr style="width:100%">
  <td style="width:33%;text-align:left;"><a class="previous_chapter" href=pick.html>Previous Chapter</a></td>
  <td style="width:33%;text-align:center;"><a href=index.html>Table of contents</a></td>
  <td style="width:33%;text-align:right;"><a class="next_chapter" href=clutter.html>Next Chapter</a></td>
</tr></table>

<div id="footer">
  <hr>
  <table style="width:100%;">
    <tr><td><a href="https://accessibility.mit.edu/">Accessibility</a></td><td style="text-align:right">&copy; Russ
      Tedrake, 2020</td></tr>
  </table>
</div>


</body>
</html>
